A Model for Segment-Based Speech Recognition
نویسنده
چکیده
Currently, most approaches to speech recognition are frame-based in that they represent the speech signal using a temporal sequence of frame-based features, such as Mel-cepstral vectors. Frame-based approaches take advantage of efficient search algorithms that largely contribute to their success. However, they cannot easily incorporate segment-based modeling strategies that can further improve recognition performance. For example, duration is a segment-based feature that is useful but difficult to model in a frame-based approach. In contrast, segment-based approaches represent the speech signal using a graph of segment-based features, such as average Mel-cepstral vectors over hypothesized phone segments. Segment-based approaches enable the use of segment-based modeling strategies. However, they introduce multiple difficulties in recognition that have limited their success. In this work, we have developed a framework for speech recognition that overcomes many of the difficulties of a segment-based approach. We have published experiments in phone recognition on the core test set of the TIMIT corpus over 39 classes [1]. We have also run preliminary experiments in word recognition on the December '94 test set of the ATIS corpus. In our segment-based approach, we hypothesize segments prior to recognition. Previously, our segmentation algorithm was based on local acoustic change. However, segmentation depends on contextual factors that are difficult to capture in a simple measure. We have developed a probabilistic segmentation algorithm called " segmenta-tion by recognition " that hypothesizes segments in the process of recognition. Segmentation by recognition applies all of the constraints used in recognition towards segmentation. As a result, it hypothesizes more accurate segments. In addition, it adapts to all types of variability, focuses modeling on confusable segments, hypothesizes all types of units, and uses scores that can be re-used in recognition. We have implemented this segmentation algorithm using a backwards A* search and a diphone context-dependent frame-based phone recognizer. In published TIMIT experiments , we have reported an 11.3% reduction in phone recognition error rate from 38.7% with our previous acoustic segmenta-tion to 34.3% with segmentation by recognition [1]. In segment-based recognition, the speech signal is represented using a graph of features. Probabilistically, it is necessary to account for all of the features in the graph. However, each path through the graph directly accounts for only a subset of all features. Previously, we modeled the features that are not in a path using a single " anti-phone " model [2]. However, the features that are not in a path depend …
منابع مشابه
Speaker Adaptation in Continuous Speech Recognition Using MLLR-Based MAP Estimation
A variety of methods are used for speaker adaptation in speech recognition. In some techniques, such as MAP estimation, only the models with available training data are updated. Hence, large amounts of training data are required in order to have significant recognition improvements. In some others, such as MLLR, where several general transformations are applied to model clusters, the results ar...
متن کاملSpeaker Adaptation in Continuous Speech Recognition Using MLLR-Based MAP Estimation
A variety of methods are used for speaker adaptation in speech recognition. In some techniques, such as MAP estimation, only the models with available training data are updated. Hence, large amounts of training data are required in order to have significant recognition improvements. In some others, such as MLLR, where several general transformations are applied to model clusters, the results ar...
متن کاملSpoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting
Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...
متن کاملAllophone-based acoustic modeling for Persian phoneme recognition
Phoneme recognition is one of the fundamental phases of automatic speech recognition. Coarticulation which refers to the integration of sounds, is one of the important obstacles in phoneme recognition. In other words, each phone is influenced and changed by the characteristics of its neighbor phones, and coarticulation is responsible for most of these changes. The idea of modeling the effects o...
متن کاملRecognizing the Emotional State Changes in Human Utterance by a Learning Statistical Method based on Gaussian Mixture Model
Speech is one of the most opulent and instant methods to express emotional characteristics of human beings, which conveys the cognitive and semantic concepts among humans. In this study, a statistical-based method for emotional recognition of speech signals is proposed, and a learning approach is introduced, which is based on the statistical model to classify internal feelings of the utterance....
متن کاملPersian Phone Recognition Using Acoustic Landmarks and Neural Network-based variability compensation methods
Speech recognition is a subfield of artificial intelligence that develops technologies to convert speech utterance into transcription. So far, various methods such as hidden Markov models and artificial neural networks have been used to develop speech recognition systems. In most of these systems, the speech signal frames are processed uniformly, while the information is not evenly distributed ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999